Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding median statistics #106

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

nicoborghi
Copy link

@nicoborghi nicoborghi commented Oct 19, 2023

The median statistic is commonly used in fields like cosmology to report central values and confidence intervals. It sets the central point to the median of the posterior distribution, and determines the upper and lower bounds from the percentiles (e.g. 16th and 84th percentiles for a "1σ" interval).

This is a quick implementation that makes use of the summary_area attribute to compute the (symmetric) percentiles.

@Samreay
Copy link
Owner

Samreay commented Oct 19, 2023

Hi @nicoborghi, thanks for the PR! I just want to make sure I'm not missing something obvious, but isn't this the same summary statistic as SummaryStatistic.CUMULATIVE - just without any histogram or smoothing going into it?

I also don't believe we can just use np.percentile, because this doesn't cater to samples possibly having different weights

@nicoborghi
Copy link
Author

nicoborghi commented Oct 20, 2023

Oh yes, you are right, I did not consider the problem of weights!

I really like CC, including the LaTex table feature and in this case I prefer to quote the parameters as median and 16-84th percentiles to be less sensitive to outlies without manually tune the smoothing, binning, or KDE parameters. However when preparing the code below I realized that in these extreme cases the plot would differ significantly from the summary statistic.

Maybe it would be helpful to have it just for the table or perform a smoothing after removing the outliers?

Thanks for your time!

arr = np.hstack([np.random.normal(0,1,1000), np.random.normal(1000,1,100)])

c = ChainConsumer()
c.set_plot_config(PlotConfig(show_legend=True))
c.add_chain(Chain(samples=pd.DataFrame(arr, columns = ["x"]), name = "Median",
                  statistics="median"))
        
d = ChainConsumer()
d.set_plot_config(PlotConfig(show_legend=True))
d.add_chain(Chain(samples=pd.DataFrame(arr, columns = ["x"]), name = "Cumulative",
                  statistics="cumulative"))
        
e = ChainConsumer()
e.set_plot_config(PlotConfig(show_legend=True))
e.add_chain(Chain(samples=pd.DataFrame(arr, columns = ["x"]), name = "Cumulative, smooth=0", 
                  statistics="cumulative",smooth=0))

fig1 = c.plotter.plot(figsize=(5,2))
fig2 = d.plotter.plot(figsize=(5,2))
fig3 = e.plotter.plot(figsize=(5,2))

image
image
image

@Samreay
Copy link
Owner

Samreay commented Oct 21, 2023

Hmm, my main concern in adding this in is that the two methods should converge, except for when you have issues in your chain as you showed in your example. The two methods being different given bad inputs isn't a worry to me.

You could also recover results that are always within an arbitrary number of significant digits if you just use cumulative and ramp the number of bins up while setting smooth=0, but this isn't very user friendly, I'd agree.

I'm still happy to add this, so long as we can get it working for weighted samples, and I'd also recommend using np.quantile instead of np.percentile(100 * list_of_quantiles)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants